speech: persistent VkPipelineCache + qvac-speech-ggml hybrid packaging by GustavoA1604 · Pull Request #7 · tetherto/qvac-ext-ggml

GustavoA1604 · 2026-05-07T11:39:23Z

Summary

Sync the speech branch with the chatterbox.cpp Vulkan optimizations and finalize the qvac-speech-ggml-* packaging convention agreed in #qvac (per-addon ggml prefix; speech-branch fork is the speech build).

Five commits, +343/-40, covering:

Per-consumer ggml library filename prefix wired through both the build and the runtime DL loader (GGML_LIB_OUTPUT_PREFIX → GGML_BACKEND_DL_PROJECT_PREFIX).
Optional persistent VkPipelineCache opt-in via GGML_VK_PIPELINE_CACHE_DIR, plus a crash-safe eager flush, recovering ~91% of the cold→warm shader-compile gap on drivers without an aggressive per-app system cache (Mesa/RADV, Android Adreno/Mali, fresh NVIDIA installs, containers). Tracks JIRA QVAC-17872.
Hybrid backend packaging cherry-pick from the 2026-01-30 branch, adapted from qvac-diffusion- to qvac-speech-. Lets the speech ggml ship as a static CPU core with MODULE GPU backends so an .aar/.apk/loose-DLL drop can dlopen Vulkan/OpenCL/CUDA at runtime without forcing the whole core dynamic.
Windows-correct pipeline-cache rename: switches both flush sites from C std::rename to std::filesystem::rename, which has POSIX overwrite semantics on Windows. Prevents the .pcache blob from getting frozen at its first-write size on the second and later saves on Windows.

Commits

build: add GGML_LIB_OUTPUT_PREFIX option for per-consumer lib filename prefix
Adds the GGML_LIB_OUTPUT_PREFIX cache var. When set, ggml libs land on disk as lib<prefix>ggml-*.{a,so,dll} and target_compile_definitions(ggml-base PRIVATE GGML_BACKEND_DL_PROJECT_PREFIX="<prefix>") is propagated so the runtime loader looks for the same names. CMake target names and the find_package(ggml CONFIG) package name are intentionally unchanged.
vulkan: persistent VkPipelineCache, explicit opt-in via GGML_VK_PIPELINE_CACHE_DIR (QVAC-17872 round 1)
Opt-in only; behaviour is byte-identical to upstream when GGML_VK_PIPELINE_CACHE_DIR is unset/empty. Cache file keyed on vendorID/deviceID/driverVersion. Save happens from ggml_vk_cleanup() (not ~vk_device_struct, which is unreliable at process exit because pipelines hold shared_ptr<vk_device_struct> ref cycles). Atomic save via tmp + rename.
vulkan: crash-safe eager pipeline-cache flush (QVAC-17872 round 2)
Flushes after every ggml_vk_load_shaders compile batch when the cache grew, so a process killed mid-graph doesn't lose freshly compiled pipelines. pipeline_cache_last_size book-keeping short-circuits both the eager flush and the cleanup-time flush on warm runs (cache-hit only). Without the short-circuit the unconditional flush regressed warm-run wall by ~90 ms on the chatterbox.cpp benchmark.
cmake: support qvac hybrid backend packaging (cherry-pick from 2026-01-30)
Cherry-pick of 512e1773 from the 2026-01-30 branch with the prefix swapped from qvac-diffusion- to qvac-speech-. Adds GGML_CPU_STATIC (CPU stays in the core .a, only GPU backends become MODULE shared libs), drops the GGML_BACKEND_DL requires BUILD_SHARED_LIBS check, makes ggml/ggml-base PIC under GGML_BACKEND_DL, and adds the Android filename-only dlopen fallback for flattened native dirs. Also pulls in the upstream-style unique_ptr conversion in ggml_backend_vk_reg_get_device (memory-leak fix; see design notes).
vulkan: use std::filesystem for pipeline-cache path/rename (Windows-correct overwrite)
Replaces the C std::rename / std::remove with std::filesystem::rename / std::filesystem::remove at both flush sites and builds pipeline_cache_path via std::filesystem::path joining. C rename on Windows fails if the destination exists, which meant the second-and-later flushes silently dropped on Windows.

Design notes (preempting common review questions)

These are deliberate choices that look unusual at first glance — calling them out so a re-review doesn't re-litigate them.

Why is `qvac-speech-ggml-` hardcoded as the no-macro fallback in `backend_filename_prefix()`?

Per the team agreement in #qvac (Gianfranco / Juan A.), each *.cpp addon that vendors its own ggml fork carries its own filename prefix to avoid dlopen filename collisions when multiple ggml versions coexist in one process / .aar / .apk:

fabric/ggml → libqvac-ggml-*
whispercpp/ggml (this branch) → libqvac-speech-ggml-*
diffusion/ggml → libqvac-diffusion-ggml-*

The speech branch is not meant to be used as a generic upstream-equivalent ggml — it is the speech build. Hardcoding the qvac-speech-ggml- filename in the no-macro fallback closes a real footgun: a downstream that builds the speech branch with -DGGML_LIB_OUTPUT_PREFIX= (empty) but doesn't define GGML_BACKEND_DL_PROJECT_PREFIX would otherwise produce libggml-*.so files but a loader hunting for libqvac-speech-ggml-*.so. Aligning both defaults to qvac-speech- makes the branch internally consistent.

The GGML_BACKEND_DL_PROJECT_PREFIX macro path is preserved so any downstream that does want to override it (e.g. a future addon vendoring this branch under a different prefix) still can.

Why isn't `GGML_LIB_OUTPUT_PREFIX` baked into `ggml-config.cmake.in`?

Intentional — find_package(ggml CONFIG) consumers set it on their own side before find_package. The find_library(NAMES "${GGML_LIB_OUTPUT_PREFIX}ggml" ggml ...) form gives a clean fallback to the bare name for unprefixed builds. We preferred that over @PACKAGE_GGML_LIB_OUTPUT_PREFIX@ substitution because:

The consumer side already needs to know the prefix to satisfy other constraints (e.g. setting GGML_BACKEND_DL_PROJECT_PREFIX for a custom build, naming their own portfile artefacts). Putting one half of the contract in the package config and the other on the consumer side fragmented the convention.
vcpkg consumers of this branch declare the prefix explicitly via qvac-speech-ggml-style port names, so the package config never sees a "wrong" prefix in practice.
The GGML_MAX_NAME pass-through in ggml-config.cmake.in follows the same opt-in shape — same rationale, kept symmetric.

Why is the unique_ptr refactor in `ggml_backend_vk_reg_get_device` bundled in the cmake commit?

It's a clean cherry-pick of the same upstream change from the 2026-01-30 branch and was part of 512e1773, not a separate commit there. The new code is a real memory-leak fix: the previous devices.push_back(new ggml_backend_device {...}) and new ggml_backend_vk_device_context raw allocations were never freed (the static std::vector<ggml_backend_dev_t> held only raw pointers and is never torn down). Splitting it out from this PR would mean a follow-up cherry-pick that diverges the file from 2026-01-30 for no benefit. Calling it out here so it's not lost.

Why is the OpenCL `qvac-parakeet patch:` comment removed?

The comment described why ggml_backend_opencl_init returns nullptr instead of asserting on a zero-device list — but the behavior (the null-device guard) is preserved. The cherry-pick from 2026-01-30 additionally hardens ggml_backend_opencl_reg_device_get to also return nullptr when no devices exist, instead of asserting; with the guard now in two places, the original single-site comment was stale. The functional contract (ggml_backend_opencl_init may return nullptr and callers must fall back to CPU) is unchanged.

Why are the two pipeline-cache flush paths near-duplicates?

The cleanup-time flush in ggml_vk_save_pipeline_cache and the eager flush in ggml_vk_load_shaders look similar but differ in one important way: the eager path requires growth (blob.size() > pipeline_cache_last_size) before writing, whereas the cleanup path writes when size differs at all. Folding them into a single save(require_growth) helper saves ~10 lines but couples two call sites with subtly different correctness invariants. Left as-is for now; happy to refactor in a follow-up if a third call site shows up.

Why is `getenv("GGML_VK_PIPELINE_CACHE_DIR")` safe?

Called once per device init, in ggml_vk_get_device, before any worker threads exist. The result is captured into device->pipeline_cache_path and never re-read. No thread-safety concern.

Why no auto-discovery of `$XDG_CACHE_HOME` / `$HOME`?

ggml is a library distributed through package managers (vcpkg) and consumed by applications that should decide whether and where to persist Vulkan artefacts. Writing to the user's home directory without being asked is a side effect library consumers cannot see from the API surface. Apps that want default-on caching can set the env var in their bootstrap.

Test plan

Linux x86_64, Vulkan (RADV, Mesa): cold→warm chatterbox.cpp benchmark, env var set → ~91% of cold compile gap recovered. Env var unset → byte-identical to upstream timing.
Linux x86_64, no env var: behaviour byte-identical to upstream (no .pcache writes, device->pipeline_cache == VK_NULL_HANDLE, createComputePipeline takes VK_NULL_HANDLE).
macOS, MoltenVK: env var set → cache file written and re-loaded across runs.
Windows x86_64, NVIDIA: with the round-5 fix, .pcache size grows across multiple eager flushes and the cleanup flush in the same process; without it, second flush silently drops.
Android arm64, Adreno: hybrid CPU-static + Vulkan/OpenCL MODULE build links, dlopen finds libqvac-speech-ggml-vulkan.so from the flattened APK native dir via the new filename-only fallback.
OpenCL fallback: zero-device host (no Adreno) → ggml_backend_opencl_init() returns nullptr, ggml_backend_opencl_reg_device_get() returns nullptr instead of asserting; caller falls back to CPU.
vcpkg consumer build of qvac-speech-ggml: find_package(ggml CONFIG) resolves libqvac-speech-ggml.a / libqvac-speech-ggml-base.a and the MODULE GPU backends.
Existing ggml unit tests: pass with GGML_LIB_OUTPUT_PREFIX=qvac-speech- (default on this branch) and with GGML_LIB_OUTPUT_PREFIX= (explicit empty).

…e prefix

…INE_CACHE_DIR Adds an opt-in persistent shader cache to ggml-vulkan. Enabled only when the caller sets GGML_VK_PIPELINE_CACHE_DIR to a non-empty path; when unset or empty behaviour is byte-identical to upstream ggml-vulkan. No auto-discovery of $XDG_CACHE_HOME or $HOME. ggml is a library distributed through package managers (vcpkg) and consumed by applications that should decide whether and where to persist Vulkan artefacts. Writing to the user's home directory without being asked is a side effect library consumers cannot see from the API surface. When enabled, createPipelineCache is seeded from the path at init and getPipelineCacheData is written back from ggml_vk_cleanup() (not ~vk_device_struct which is unreliable at process exit due to shared_ptr ref cycles). File keyed on vendorID/deviceID/driverVersion; Vulkan validates the blob header and silently ignores stale data if the shader bundle or driver changed. Atomic save via tmp+rename. Recovers ~91% of the cold->warm shader-compile gap on the first warm run on drivers without an aggressive per-app system cache (Mesa/RADV, Android Adreno/Mali, fresh NVIDIA installs, containers). Backport from chatterbox.cpp PR GustavoA1604/chatterbox.cpp#8 (QVAC-17872, round-1). Co-authored-by: Cursor <cursoragent@cursor.com>

Stacks on the previous patch. Writes back the on-disk pipeline-cache blob after every ggml_vk_load_shaders compile batch instead of only at ggml_vk_cleanup() time, so a process killed mid-graph (SIGKILL, abort, OS shutdown) doesn't lose the freshly compiled pipelines. Adds pipeline_cache_last_size book-keeping so warm runs short-circuit the disk write: the eager path only flushes when the cache actually grew (blob.size() > last_size), and the cleanup path skips when size matches last_size. This avoided a +90 ms WALL regression measured during dev when the flush was unconditional. Backport from chatterbox.cpp PR GustavoA1604/chatterbox.cpp#8 (QVAC-17872, round-2). Co-authored-by: Cursor <cursoragent@cursor.com>

…1-30) Cherry-pick of 512e177 from the 2026-01-30 branch, with the lib filename prefix swapped from qvac-diffusion- to qvac-speech-: - drop the GGML_BACKEND_DL requires BUILD_SHARED_LIBS check; static ggml core now coexists with MODULE GPU backends when GGML_CPU_STATIC=ON. - ggml_add_backend_library skips MODULE for ggml-cpu-* when GGML_CPU_STATIC, so CPU stays in the core .a and only Vulkan/OpenCL/CUDA become .so. - ggml/ggml-base get POSITION_INDEPENDENT_CODE=ON when GGML_BACKEND_DL is set, so MODULE backends can link the static core. - ggml gets GGML_USE_CPU compile-define when GGML_CPU_STATIC. - backend_filename_prefix() defaults to libqvac-speech-ggml- (matches the GGML_LIB_OUTPUT_PREFIX default on this branch). - ggml-config.cmake.in handles the hybrid mode: exports the static CPU variant target while leaving GPU backends to ggml_backend_load_best at runtime. - ggml_backend_opencl_init keeps the speech-branch's null-device guard (drop-clean fallback when ggml-opencl rejects all visible devices).

…orrect overwrite) Two fixes on top of QVAC-17872 round-1/round-2: 1. Replace C `std::rename` / `std::remove` with `std::filesystem::rename` / `std::filesystem::remove` at both flush sites (cleanup-time flush in ggml_vk_save_pipeline_cache, and the eager flush in ggml_vk_load_shaders). The C runtime `rename` on Windows fails when the destination already exists (per MSDN), which meant the second-and-later saves of the .pcache blob would silently fail and the on-disk cache would never advance past its initial size on Windows. std::filesystem::rename has POSIX overwrite semantics on every platform we target. 2. Build pipeline_cache_path with std::filesystem::path joining instead of `dir + "/" + fname` string concatenation. Avoids mixed-separator surprises if a caller passes a backslash-terminated dir on Windows. Behaviour-equivalent to round-2 on Linux/macOS; Windows now actually persists subsequent flushes instead of dropping them. Co-authored-by: Cursor <cursoragent@cursor.com>

GustavoA1604 and others added 4 commits May 5, 2026 13:29

build: add GGML_LIB_OUTPUT_PREFIX option for per-consumer lib filenam…

4cec2d3

…e prefix

GustavoA1604 changed the title ~~Update speech branch with chatterbox vulkan optimizations~~ speech: persistent VkPipelineCache + qvac-speech-ggml hybrid packaging May 7, 2026

jpgaribotti approved these changes May 7, 2026

View reviewed changes

gianni-cor approved these changes May 7, 2026

View reviewed changes

gianni-cor merged commit 91676f0 into tetherto:speech May 7, 2026

GustavoA1604 mentioned this pull request May 7, 2026

Add tts-cpp port + bump parakeet-cpp / ggml-speech to port-version 1 tetherto/qvac-registry-vcpkg#137

Merged

10 tasks

This was referenced May 13, 2026

ggml: bump to qvac-ext-ggml#8 (Supertonic ops + Vulkan/Metal fixes) tetherto/qvac-registry-vcpkg#143

Closed

ggml: bump to qvac-ext-ggml#8 (speech HEAD 60a172e) — port-version 8 tetherto/qvac-registry-vcpkg#144

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speech: persistent VkPipelineCache + qvac-speech-ggml hybrid packaging#7

speech: persistent VkPipelineCache + qvac-speech-ggml hybrid packaging#7
gianni-cor merged 5 commits into
tetherto:speechfrom
GustavoA1604:speech

GustavoA1604 commented May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

GustavoA1604 commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commits

Design notes (preempting common review questions)

Why is qvac-speech-ggml- hardcoded as the no-macro fallback in backend_filename_prefix()?

Why isn't GGML_LIB_OUTPUT_PREFIX baked into ggml-config.cmake.in?

Why is the unique_ptr refactor in ggml_backend_vk_reg_get_device bundled in the cmake commit?

Why is the OpenCL qvac-parakeet patch: comment removed?

Why are the two pipeline-cache flush paths near-duplicates?

Why is getenv("GGML_VK_PIPELINE_CACHE_DIR") safe?

Why no auto-discovery of $XDG_CACHE_HOME / $HOME?

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

GustavoA1604 commented May 7, 2026 •

edited

Loading

Why is `qvac-speech-ggml-` hardcoded as the no-macro fallback in `backend_filename_prefix()`?

Why isn't `GGML_LIB_OUTPUT_PREFIX` baked into `ggml-config.cmake.in`?

Why is the unique_ptr refactor in `ggml_backend_vk_reg_get_device` bundled in the cmake commit?

Why is the OpenCL `qvac-parakeet patch:` comment removed?

Why is `getenv("GGML_VK_PIPELINE_CACHE_DIR")` safe?

Why no auto-discovery of `$XDG_CACHE_HOME` / `$HOME`?